Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-17682: [Format] Add Bool8 Canonical Extension Type #43234

Merged
merged 5 commits into from
Aug 8, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Jul 12, 2024

Rationale for this change

Closes: #17682

Arrow Boolean arrays store values as individual bits, which is a very compact representation but does not match the layout of many systems with which it interoperates. By adding an 8-bit Boolean extension type, zero-copy compatibility with many systems can be improved at the cost of large physical representation.

Go implementation: #43323
C++ / Python implementation: #43488

What changes are included in this PR?

Proposal and documentation for Bool8 canonical extension type.

Are these changes tested?

N/A

Are there any user-facing changes?

N/A

Copy link

⚠️ GitHub issue #17682 has been automatically assigned in GitHub to PR creator.

"github.com/apache/arrow/go/v17/internal/json"
)

const ExtensionNameBool8 = "bool8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason for this to be exported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason currently, but I did have a thought about how it could be used.

Right now we have the functions (Register|Unregister|Get)ExtensionType for interacting with the global extension registry which developers can use to opt in to any extensions their application recognizes. As we make a clearer distinction between regular extension types and canonical extension types, I was thinking that we could automatically pre-register extensions that are canonical so that applications automatically recognize them.

Whether we do this or not, the exported name key could make it easier to interact with the registry using a consistent identifier.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think that's fine, not sure about automatically registering canonical types though. Does the C++ lib do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this const entirely because it's only used in one place (in the Bool8Type.ExtensionName impl) and it's simple enough to just get the type itself, which is exported, and just call ExtensionName() if needed.

I'll also leave out auto-registering canonical extensions for now. Once Opaque merges and UUID gets moved out of internal, we can set that up for them together.

FYI all subsequent code changes are in #43323

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 12, 2024

func (b *Bool8Type) Serialize() string { return ExtensionNameBool8 }

func (b *Bool8Type) String() string { return fmt.Sprintf("Bool8<storage=%s>", b.Storage) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convention is all lower case

I also think that convention is extension_type<storage=%s> for the String method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that's exactly what the ExtensionBase will fall back to if this is not implemented:

func (e *ExtensionBase) String() string { return fmt.Sprintf("extension_type<storage=%s>", e.Storage) }

Is there a convention for overriding this when the actual extension type is known, or should it just fall back to the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best would be to check the C++ lib implementations and see what they do. Ideally keeping the string representations close would be great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention in C++ (current precedent is just FixedShapeTensor for now) appears to be extension<EXTENSION_NAME[TYPE_PARAM_NAME=TYPE_PARAM_VALUE, ...]>.

For FixedShapeTensor that ends up looking like extension<arrow.fixed_shape_tensor[value_type=int32, shape=[2,2]]> for example.

So then I guess in this case it would be extension<arrow.bool8>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed this change to #43323

Comment on lines 351 to 355
case arrow.EXTENSION:
typ := dtype.(arrow.ExtensionType)
bldr := NewExtensionBuilder(mem, typ)
if custom, ok := typ.(ExtensionBuilderWrapper); ok {
return custom.NewBuilder(bldr)
if custom, ok := dtype.(CustomExtensionBuilder); ok {
return custom.NewBuilder(mem)
}
return bldr
return NewExtensionBuilder(mem, dtype.(arrow.ExtensionType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ensure this is being tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. I just added some additional tests comparing the behavior of an extension type with/without a custom builder to really make the distinction clear.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes Component: Documentation and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 12, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 25, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one LGTM!


* Description of the serialization:

No metadata is required to interpret the type. Any metadata present should be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should not leave any ambiguity here. Perhaps reuse the same wording as for JSON above?

Suggested change
No metadata is required to interpret the type. Any metadata present should be ignored.
Metadata is either an empty string or a JSON string with an empty object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Any objection with just Metadata is an empty string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it would be fine with me.

@v1gnesh v1gnesh mentioned this pull request Aug 5, 2024
@chmp chmp mentioned this pull request Aug 7, 2024
4 tasks
@joellubi joellubi marked this pull request as ready for review August 8, 2024 17:46
@joellubi joellubi merged commit c537700 into apache:main Aug 8, 2024
7 checks passed
@joellubi joellubi removed the awaiting merge Awaiting merge label Aug 8, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c537700.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

joellubi added a commit that referenced this pull request Aug 12, 2024
### Rationale for this change

Go implementation of #43234

### What changes are included in this PR?

- Go implementation of the `Bool8` extension type
- Minor refactor of existing extension builder interfaces

### Are these changes tested?

Yes, unit tests and basic read/write benchmarks are included.

### Are there any user-facing changes?

- A new extension type is added
- Custom extension builders no longer need another builder created and released separately.

* GitHub Issue: #17682

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Joel Lubinitsky <[email protected]>
felipecrv pushed a commit that referenced this pull request Aug 21, 2024
### Rationale for this change

C++ and Python implementations of #43234

### What changes are included in this PR?

- Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests
- Implement Python bindings to C++, as well as zero-copy numpy conversion methods
- TODO: docs waiting for rebase on #43458

### Are these changes tested?

Yes

### Are there any user-facing changes?

Bool8 extension type will be available in C++ and Python libraries

* GitHub Issue: #17682

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add ExtensionType implementation for 8-bit boolean values
10 participants